Final project for the Introduction to Data Science / Text as Data class
By Adelaida Barrera (adelaidabarrera@gmail.com), Natalia Mejía, (natimp555@gmail.com), Mariana Saldarriaga (m.saldarriaga15@gmail.com) and Isabel de Brigard (isabeldebrigard@gmail.com)
This semester we seem to be uninterruptedly glued to our screens. From zoom, to R, from moodle to social media, more and more of our days are spent in front of our phones or computers. But this trend, though exacerbated by the unusual conditions of the last year (ugh, yes, it has almost been a year) did not start with some flying rodent far away. Digital platforms have carved up more and more of our time, and seem to direct more and more of our actions. As policy students, we were interested in one instance where this seems to be happening very notably: the way twitter communicates, condenses, and shapes public discourse around salient policy issues. -Guess it’s not procrastination if you can call it research…
But the question “How does Twitter shape public discourse?” seemed a bit ambitious for finals and the four hours of daylight Berlin offers at the end of the year. So we decided to narrow our focus and explore the public discourse around feminism and gender issues revealed by a select (yes, select as in selected by us, more on this in a minute) group of twitter accounts of activists, political leaders, writers, and all around opinion shapers from Colombia. We wanted to flex our newly acquired web scraping and text mining muscles, as well as try our hand at some initial network analysis. It was a bumpy-but-fun road that mostly got us excited to keep at it, till we are fully fluent in data science (looking at you regular expressions).
In what follows you will find, first, a brief section on our sample. From how we chose the accounts and what a savage beast our data was at the beginning, to how we tried to tame it and what it looked like when finally we decided to take it for a spin. Then we will get down to business and, with the help of some bi-grams and topic modeling, ask what these accounts actually talk about. We will then attempt to understand how they relate to or differ from one another. You’ll find some scaling and some network analysis. Finally, with the help of sentiment analysis, we will explore how these tweeters feel about a couple of interesting and somewhat controversial topics.
Because we are in academia after all, a few caveats before we begin.
Despite all this, we do think we can draw a couple of interesting conclusions from this first glimpse into our local twitterverse. These are the highlights:
(Aquí pondría un par de bulletpoints con algunas conclusiones que queden al final, si queda alguna.)
Our initial intuition was that certain twitter accounts shape public discourse and that gathering those would give us a balanced and relatively complete picture of what most twitter-talk was about. This is partly the idea behind the Cifras y conceptos opinion leaders panel, that traces the opinion of various individuals on a wide range of topics. These opinion leaders, they say, “differ from public opinion in general, because they are the ones who guide the climate of opinion, have the capacity for foresight and influence political issues and issues on the national agenda” (cita https://cifrasyconceptos.com/productos-panel-de-opinion/), an so tracing their points of view should be telling of more than their personal standing on a given topic.
So we dived into the twitterverse to see who came out to greet us. With a combination of research, personal experience and some calls to people in the Colombian political sphere we came up with 69 individuals and 39 institutional accounts that we felt had to be included if we were interested in what was being said about feminism and gender issues. This gave us an initial tweet count of upwards of XXX, which seemed like a decent amount of text to begin with.
But yes. We know. This is not a complete, balanced, objective picture of the public discourse on twitter on these issues. Moreover, there is no way to know from our data how biased or incomplete our sample is. We know. Remember the part about the research grant? Well, the phone still hasn’t rang. But we decided to keep going with what we had. This was our thinking: our agonizing about how bad our selection was only clarified further what our data science professors have been telling us since Stats I: fancy analytical tools only get you so far. If you actually want to be able to say something about the world, you need to work on your theory. Really work on it. But we felt this was an exercise about the tools we had learned. The tools, not the theory. And for that -to try our hand on a limited sample- we had enough. The rest was standard web scraping. Yes, that phrase actually makes sense to us now. We set up our API authorization and scraped away. And what a beautiful mess we got.
In our initial exploration of the data, we looked at the average tweets per individual account and the less recent tweet by account. We then plotted the frequency of tweets across time (left plot). This exploration showed that, because some of the accounts posted content much less frequently than others, the last 3.200 tweets of each account represented very different time periods. We thus limited our data to tweets from the last 6 months and plotted those (right plot). This produced a much more balanced sample, with 67 individual accounts and 116.402 tweets.
From this sample, we then removed 22 accounts from congresswomen. We decided to do this after having run a topic model with a random sample of 7000 tweets -which was already a stretch for our 2012 laptops. Although we knew this had implications for what we would be able to say about in our analysis later, these accounts had too much content pertaining to topics other than gender / feminism and including them would have made it even harder to get a sense of what public discourse around these issues actually is.
Finally, we restricted the institutional accounts to match the same period we had chosen for the individual ones, and ended up with 39 accounts and 25.941 tweets. Here, because the institutions we had chosen are explicitly dedicated to the topics we were interested in, there was no need to leave anyone out. Institutions are, well, more institutional…
With our data ready and the help of quantada, we created a corpus. Finally, text was data. And so we did what any text miner would do: we got our rags and buckets out, put our aprons on, and began cleaning.
We removed stop words (both those that come in the tm package, as some we compiled in our own list), punctuation, numbers, and symbols. Then we removed mentions: we were after the what is what, more than the who is who of Colombian feminist tweeter. (And we would get to connections later on, with the network analysis.) Next were hashtags. Here, again, we understood that this would limit our analysis somewhat, but we felt we had a solid theory based reason for it. So we had that going for us, which is nice. The reason is that hashtags tend to work globally, as a shortcut to the apparently borderless internet conversacion. And we felt including them might disrupt the picture of the more local discourse we were trying to paint. (Esto no creo que esté bien explicado.)
And then, we did it all over again for our institutional accounts. By this time, this project was beginning to feel a little like what we figure raising twins must be like: you do a lot of cleaning. And you do it all twice. But we had gotten this far. And we were finally ready to see what all these tweets were about.
Before modeling, we tried to visualize the data we had. We wanted to observe the data structures to look for text relations. Following certain techniques found online, like the one employed by Orduz (2018), we performed a network analysis. This will allow us to understand graphically the tweets text as a weighted network.
As a first exploration, we saw the pairwise relative occurrence of words. We did a bi-gram analysis for individuals and institutional accounts. We created the bi-grams and did the respective cleaning of them (remove stopwords, https, emoticons, pair of words not relevant, etc.). Afterwards, we defined a weighted network from the bigram count and got our first graphs (for individual and institutional accounts):
We also add some additional information to the visualization. We set the sizes of the nodes and the edges by the degree and weight respectively. We used the function strength to get the weighted degree.
Moreover, we extract the biggest connected component of the network to understand the most frequent conversation between gender public opinion leaders -individuals and institutions- in Twitter. We compute the clusters with a big threshold (100) and also with a smaller threshold (50). The last allows us to get a more complex network.
Finally, as Orduz (2018), we employ the Louvain Method for community detection. The precedent is an algorithm for detecting communities in networks. It evaluates how much more densely connected the nodes within a community are, compared to how connected they would be in a random network (neo4j, December 2020). It recursively merges communities into a single node and executes the modularity on the condensed graphs. We perform the method to check precisely the density of our connected nodes (words).
Among the individuals account, we observe that the pair of words connected with higher weight within the network are (among others):
This means that gender conversation leaders in Twitter tweet mostly about sexual violence and harassment at work, trans women, human right, etc. The last will give us clues for our models’ topic results.
With a big threshold (100), the biggest connected compound is not a surprise. However, when we decrease the threshold, we see a more complex network of words. It seems that most of the tweets of gender public opinion leaders are build on progressive movements who advocates for the rights of women and, particularly, trans women who have been killed.
Finally, the community detection results for individuals account show that four groups where identified and the modularity (measure for the “quality” of certain partition of the nodes in a network like clusterings) is 0.5 within the biggest connected compound of the word network. This result doesn’t seem either good or bad (better if closer to 1). However, we believe the 0.5 modularity is a good number to show the quality of the conversation’s density of connections (within words).
The following pairwise of words are the most frequent and relevant within institutions in gender conversations:
The following graphs allow us to conclude that the conversation among institutions, who aim gender equality -in their own way- in Colombia, is mainly about women’s victims. Specifically, rural, young and indigenous women. The last seems reasonable since structural inequalities affect the most indigenous and rural women. Furthermore, gender equality has been deeply discussed in peacebuilding conversations. Women are one of the groups most affected by the armed conflict in Colombia.
The community detection results show that 2 groups where identified and the modularity is 0.22. Institutions have a smaller modularity then individuals. It seems that the individuals conversation is more densily connected than institutional discourse on Twitter.
Conversations among activists, political leaders, writers, and all around opinion shapers from Colombia, including private or public institutions and NGOs in defense of women’s rights, are centered in women’s right. Who would say? What a surprise! Within individuals account, it seems that work harassment and sexual violence are in the center of the conversation. On the contrary, institutions are more focused on exposing and advocating for the injustices towards indigenous and rural women.
We got our dfm for individual accounts and turned it into a stm corpus to run our topic model and defined 10 topics most prevalent in the tweets we had. And yes, we then did it again for the institutional accounts. We broadly identified what the 10 topics were for both the individual and the institutional accounts, although the model does not perfectly classify the documents according to our posterior interpretation. But all in all what we got seemed reasonable.
Our chosen institutions spend a lot of time tweeting about women in the public sphere (well, duh…). They also use twitter to talk about their institutional events and work, which was also to be expected. And then it gets more interesting. Institutional accounts talk almost as much about social policy, as they talk about violence. And both the armed conflict and the truth commission figure prominently.
The model also picked up the conversation about reproductive rights that was sparked by an attempt made mid November to re-criminalize the three grounds on which abortion is currently legal in Colombia. In line with the decision of the supreme court -which upheld its 2006 verdict de-criminalizing abortions under certain circumstances-, the accounts we chose talk about abortion in terms of rights and access.
Another cluster of discourse formed around the LGBT community and the pandemic. This is probably due to the escalation of police violence against the LGBT community in their efforts to enforce curfews put in place due to the pandemic. But in true institutional spirit, this topic includes more words about dialogue, than about accountability.
Individual accounts center on women’s rights, which seems fairly obvious. It is however interesting that discourse here seems focused still on achieving equality with respect to men, which might be an initial indication of how far the debate on gender is in Colombia.
As with the institutional accounts, violence features very prominently, but here it is mostly connected to the state.
Individual accounts also comment often on wider topics of national politics and public opinion, which was perhaps to be expected, but raised our concerns about the classification of the documents our model was able to do.
We got curious about how different occupations might affect the prevalence of these topics in each account. And since we had that information, we went ahead and made more plots:
A couple of interesting outcomes:
Then, with the help of the LDA model, we calculated the probability of each word being generated from each topic (betas) and the ‘per-document-per-topic probabilities’: the proportion of words from that document that are generated from each topic.
We plotted the whole thing, inspired by Julia Silge’s blog, which is awesome and which you can check out here: https://juliasilge.com/blog/evaluating-stm/
Some things caught our eye:
After running the LDA model we wanted to observe if the topics change in time, se we plotted the proportion of documents from each topic by week from July to December of 2020. We analyzed the events that occurred during this period to get a better understanding of the trends the data presents.
Some things caught our eye:
Some things caught our eye:
Now we move to positions! We want to know how each twitter account live in space and how they relate to each other. To do so, we scaled some accounts in one space based on the tweets discussing certain topics. The scale allows us to identify how close is each account from each other depending on the vocabulary they use regarding a topic, in this case we selected political topics and reproductive health topics.
It is important to mention that inside our selected categories (political topics and reproductive health) the accounts include information of other topics, hindering the precision of the analysis. It was not possible to do a clean subset of the twitter accounts that only discussed one topic. This is why, we believe the topic model has some limitations with twitter data and more manual classification is needed to improve precision.
Analyzing the scale based on tweets discussing political topics, it is possible to observe that from the sample of 25 twitter accounts, only three of them position in the positive quadrant, one in 0 and the rest of them in the negative quadrant. These results imply that most of the twitter accounts selected, when discussing political topics, use similar vocabulary, so they are close to each other in the scale.
Lacadavidc, an activist, distances from the group positioning beyond -3, this could be due for her polemic positions against the women trans movement. In contrast, we observe alejaoficial, who positions almost in 1, she is a famous artist who advocates against women violence. The scale shows these two women, even though they talk about women´s rights, they do it using different vocabulary.
What do the extremes of the scale seem to represent?
The institutional accounts scale discussing reproductive rights topics show us an interesting account grouping. The accounts close to 0 and the positive quadrant are institutions that advocate for women´s rights: Women_ Equity, Women Commission Colombia, Women Secretary and ONU Women, so these accounts mainly use vocabulary related to women and their reproductive rights. In the middle, between -2 and 0, the spectrum opens more, and we can find activists twitter accounts, these accounts are close to each other which means they use similar vocabulary and discuss similar topics, for instance, regarding abortion and the LGBT community. The last group is the Constitutional Court and the Truth Commission, we believe they are the extreme left because they use a different language compared to the other institutions or activists accounts, and they discuss a variety of topics regarding reproductive rights beyond women.
https://juanitorduz.github.io/text-mining-networks-and-visualization-plebiscito-tweets/
https://neo4j.com/docs/graph-algorithms/current/algorithms/louvain/#:~:text=The%20Louvain%20method%20for%20community,for%20detecting%20communities%20in%20networks.&text=The%20Louvain%20algorithm%20is%20a,clustering%20on%20the%20condensed%20graphs.